CS7641 - Machine Learning

Videos: https://omscs.gatech.edu/cs-7641-machine-learning-course-videos

Supervised Learning (SL)

Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.

Classification vs Regression

  • Regression: If output is continuous
  • Classification: If output is a small, discrete set

Classification Learning

Terms

Term Definition
Instances Input. Vectors of attributes that defines your input space.
Concept Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function.
Target Concept The correct concept to map instances to oututs.
Hypothesis Class Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output).
Sample Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning.
Candidate Concept Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy)
Testing Set Set of instances to test a candidate concept.

SL1: Decision Trees

Representation vs Algorithm

  • Decision Trees are a specific representations.
  • Algorithm is the method in which we build a decision tree.

Quiz: Running a decision tree

Work through each instance and give an answer based on the decision tree.

img

Note:

  • These examples are part of a test set.
  • The tree in the image is our candidate concept.
  • Some attributes doesn't matter, thus never used.

Decision Tree Algorithm

"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.

Learning Algorithm

  1. Pick the "best" attribute. "Best" is the attribute that splits the training set in half.
  2. Ask question
  3. Follow the answer path.
  4. Go to 1, until you get to an answer.

Quiz: Picking the "best" attribute

img

Notes:

  • Candidates 1 & 2 may be equally bad. 1 creates pretty much the same distribution after split. 2 doesn't split at all.

Expressiveness

Example: Boolean expressions as decision tress

img

Expressiveness per Tree Types

img

Type Size of Nodes Description
n-OR ("any" fn) linear $n$ Easy to solve
n-XOR ("odd/even parity" fn)* $O(2^{n})$ Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes).

Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.

* If number of attributes that are true is odd, then the output is true.

Expressiveness Continued

img

Q: Exactly how expressive is a decision tree?

Given:

  • n attributes (boolean) - $O(n\!)$ (combinatorial) to build nodes
  • outputs is boolean
  • Q1: If mapped to a truth table, how many rows are needed?
  • A1: $2^n$ rows
  • Q2: How many ways to fill in outputs (outputs are boolean)
  • A2: $2^{2^n}$ (really big number, grows very fast)

A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation

ID3 Algorithms

Loop:
    1. Pick best attribute A
    2. Assign A as decision attribute for Node
    3. For each value A, create a descendant of Node
    4. Sort training examples to leaves
    5. Stop if examples perfectly classified
    6. Else iterate over leaves

Using Information Gain to Choose Best Attribute

img

Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.

Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"

  • Intuition: Measure of the amount of information. Reducing entropy means we are gaining more information. Goal is to maximize information gain by choosing attribute that lowers entropy the most.
  • Example: Fair vs 2 heads coin. Fair coin has 50/50 chance of heads, making it a 1 bit entropy. 2 heads coin has 100 chance of head, making it zero entropy (no information).
    • The more of one label you have over another, the amount of entropy goes down.

Formula:

  • $S$: Collection of training examples
  • $A$: Specific attribute
  • $v$: Specific values of an attribute (categorical, continuous)
$$ Gain(S,A) = Entropy(S) - \Sigma_v\frac{|S_v|}{|S|} Entropy(S_v) $$

Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$

ID3: Inductive Bias

Types of bias:

  1. Restriction bias: Hypothesis set that you care about.
    • For DT, includes all possible DT representations.
    • Does NOT include functions outside of DT (quadratic, etc).
  2. Preference bias: What sorts of hypotheses in the hypothesis set ($h \in H$) we prefer.
    • Inductive bias == preference bias

ID3's inductive bias - Types of preferences:

  • Prefers Good splits near the top.
  • Prefers representations that return correct answer over incorrect ones.
  • Prefers shorter trees to longer trees. Come naturally from preferring good splits near the top.

Decision Trees: Other Considerations

img

Continuous Attributes

Q: How to build DTs with continuous attributes?

  • A: Create a binary branching decision based on continuous ranges.

Q: T/F, does it make sense to repeat an attribute along a path in a tree?

  • A1: False for discrete value attributes because there is no information gain asking the same question.
  • A2: True for continous attributes, because we can ask different questions based on the same attribute. Further nodes down the branch could use different ranges to split.

When to stop

Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.

Solutions

  • Validation: Use cross-validation, or use held out validation set to verify if we want to continue branching.
  • Pruning: Build the tree first, and then prune it based on how much error the pruning will cost (baesd on validation set).
    • Output is based on mode

How to build DT using regression?

Regression implies continuous outputs. Information gain doesn't work well with continuous values.

Q: What's the splitting criteria for regression trees?

  • A: You can split by variance to decide.

Q: What's the output?

  • A: You can use the average

Summary

img

SL2: Regression

Regression & Function Approximation

img

Etymology: "Regression to the mean"

  • Slope of less than 1 (ie. 2/3) represents many natural phenomenon like parent/child height relations.
  • "Regression to the mean" means using a functional form to approximate a set of data points.

Regression in ML

Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).

Finding the best constant function

img

  • Above image shows how to derive the constant function of a squared sum of errors. Deriving the function shows the minimum.
  • In this case, the best constant is the average of all the $y$ values.

Polynomial Regression

Choosing the Order of Polynomial

img

  • The greater the order of polynomial used, the closer we can fit the data at the cost of sensitivity to noise (ie. overfitting).

Math: Polynomial Regression

img

img

  • Each $x$ (data point) is calculate to its nth-degree exponent in the function. This builds a matrix.
  • This is then matrix multiplied by the order of coefficient for each exponent.
  • Each row maps to one $y$ response.
  • $w$ is the network of weights of coefficients that give the best fit.

Errors, and where they come from

Training data has error that doesn't model $f$, results in $f+\epsilon$

Types of error sources:

  1. Sensor errors
  2. Malicious data
  3. Transcription error
  4. Unmodeled influences that is hard to model (ie. fluctuation in housing prices, time of day, buyer's mood)

Cross Validation

Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.

Aside - Assumptions of SL:

  • Train/test sets are IID (independent and identically distributed)

Applied to housing example in video:

  1. Gap between train/cross-val error narrows up to 3-4th degree
  2. Higher order polynomials begin overfitting to the training data at the expense of generalization on unseen examples.

img

Other Input Spaces

img

  • Regression can handle discrete, vector or scalar attributes.
  • Handling categorical: Enumerate (OK), or 1-hot encode (Better) - because enumeration gives categories a rank.

Summary

img

SL3: Neural Networks

video

Example of a perceptron (binary classifier with 3 input + weights)

img

Quiz: Example inputs and estimated y

img

How powerful is a perceptron unit?

Image: Representation of activation space given 2 data + weights. Generates a halfplane.

img

Takeaway: Perceptrons are linear functions that computes a hyperplane.

Perceptron Quizzes

Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?

  • A: AND Logic

Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?

  • A: (Many answers) $w_1 = 0.5$, $w_2 = 0.5$, $w_3 = 0.5$, which moves the activation function slope downward.

Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?

  • A: $w_1 = -1$ (turns 0 into 0, 1 into -1), and $\theta = 0$ so any negative number is above zero.

Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.

XOR as Perceptron Network Quiz

img

In [16]:
# Perceptron Quiz answer in code

def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
    activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
    return 1 if activation >= theta else 0

def test_xor(w1, w2, w_and, theta): 
    assert xor_network(0,0, w1, w2, w_and, theta) == 0
    assert xor_network(0,1, w1, w2, w_and, theta) == 1
    assert xor_network(1,0, w1, w2, w_and, theta) == 1
    assert xor_network(1,1, w1, w2, w_and, theta) == 0
    print("Passed all tests")

w_1 = 1
w_2 = 1
w_and = -2
threshold = 1   
test_xor(w_1, w_2, w_and, threshold)
Passed all tests

Answer to XOR question:

  • We essentially want the network to work like an OR logic, except when AND is 1. - Previous quiz showed that OR logic can be created by setting $w_1$ and $w_2$ to 1 and $\theta$ to 1.
  • We can therefore add more negative weight to balance the OR logic by setting $w_\text{and}$ to -2, which offsets $\Sigma(x_1w_1, x_2w_1)$ and gives us 0. img

Perceptron Training

Definition: Given examples, find weights that map inputs to outputs.

2 ways in which perceptron training is done:

  1. Perceptron Rule: Using the threshold ($\theta$)
  2. Gradient Descent / Delta Rule: Using the "unthreshold" value.

Perceptron Rule

Goal: How to set the weights of a single unit so that it matches a training set.

img

Components:

  • $x$: input (single unit for this example)
  • $w_i$: Weight of $i$-th unit
  • $y$: Target
  • $\hat{y}$: Prediction
  • $\eta$: Learning rate. Affects amount of update rule.
  • $\theta$: Uses math trick so that we just set bias to 1 so we don't have to worry about the threshold.

(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.

$$ w_i = w_i + \Delta w_i $$

(2) Defining weight change amount - First, calculate $\hat{y}$.

  • If y and $\hat{y}$ are equal, then we don't need to change the weight.
  • If they are not equal, then we get either a positive or negative value.
$$ \hat{y} = (\Sigma_i w_i x_i \ge 0) $$

(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.

$$ \Delta w_i = \eta (y-\hat{y})x_i $$

Linearly Separable & Perceptron Rule

Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.

Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.

Q: What if the data is not linearly separable?

  • A: Continue until iteration limit is reached (not easy to figure out).

Gradient Descent

Motivation: More robust to linearly non-separable datasets.

  • Key is to not use thresholding.

img

Components:

  • $\alpha$: Activation, the sum of $x_i$ and $w_i$
  • $y$: Target
  • $\hat{y}$: Estimated output, ${\alpha \ge 0}$
  • $E(w)$: Error metric

(1) Calculate the error metric. Goal is to minimize this error.

  • $\frac{1}{2}$ is just a helper to make deriving the partial derivative easier.
$$ E(w) = \frac{1}{2} \Sigma_{(x,y)\in D} (y-\alpha)^2 $$

(2) Find the partial derivative of above.

  • Why? We do this to find the best weights to minimize the error.
$$ \frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2} \Sigma_{(x,y)\in D}(y-\alpha)^2 $$

(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.

$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$

Comparison of learning rules

Type Learning Rule Pros Cons
Perceptron Rule $$\Delta w_i = \eta (y-\hat{y})x_i$$ Guaranteed finite convergence Requires linear separability
Gradient Descent $$\Delta w_i = \eta (y-\alpha)x_i$$ More robust to non-linearly seprable data Likely to converge only on local optimum

Quiz: Why not do gradient descent on $\hat{y}$?

  • Answer: Because $\hat{y}$ will always be an integer, hence it is non-differentiable. Example of how this will look like on a 2d plot

img

SL4: Instance Based Learning

SL5: Ensemble Bagging & Boosting

SL6: Kernel Methods & SVMs

SL7: Computational Learning Theory

SL8: VC Dimensions

SL9: Bayesian Learning

SL10: Bayesian Inference

End of Midterm 1 material


UL1: Randomized Optimization

UL2: Clustering

In [ ]: